There has been a rapid uptake in the use of non-linear dimensionality reduction (NLDR) methods such as t-distributed stochastic neighbour embedding (t-SNE) in the natural sciences as part of cluster orientation and dimension reduction workflows. The appropriate use of these methods is made difficult by their complex parameterisations and the multitude of decisions required to balance the preservation of local and global structure in the resulting visualisation. We present visual diagnostics for the pragmatic usage of NLDR methods by combining them with a technique called the tour. A tour is a sequence of interpolated linear projections of multivariate data onto a lower dimensional space. The sequence is displayed as a dynamic visualisation, allowing a user to see the shadows the high-dimensional data casts in a lower dimensional view. By linking the tour to a view obtained from an NLDR method, we can preserve global structure and through user interactions like linked brushing observe where the NLDR view may be misleading. We display several case studies from both simulated and real data from single cell transcriptomics, that shows our approach is useful for cluster orientation tasks. The implementation of our framework is available as an R package called liminal available at https://github.com/sa-lee/liminal.
High dimensional data is increasingly prevalent in the natural sciences and beyond but presents a challenge to the analyst in terms of data cleaning, pre-processing and visualisation. Methods to embed data from a high-dimensional space into a low-dimensional one now form a core step of the data analysis workflow where they are used to ascertain hidden structure and de-noise data for downstream analysis .
Choosing an appropriate embedding presents a challenge to the analyst. How does an analyst know whether the embedding has captured the underlying topology and geometry of the high dimensional space? The answer depends on the analyst’s workflow. Brehmer et al. (2014) characterised two main workflow steps that an analyst performs when using embedding techniques: dimension reduction and cluster orientation. The first relates to dimension reduction achieved by using an embedding method, here an analyst wants to characterise and map meaning onto the embedded form, for example identifying batch effects from a high throughput sequencing experiment, or identifying a gradient or trajectory along the embedded form (Nguyen and Holmes 2019). The second relates to using embeddings as part of a clustering workflow. Here analysts are interested in identifying and naming clusters and verifying them by either applying known labels or colouring by variables that are a-priori known to distinguish clusters. Both of these workflow steps rely on the embedding being representative of the original high dimensional dataset, and becomes much more difficult when there is no underlying ground truth.
As part of a visualization workflow, it’s important to consider the perception and interpretation of embedding methods as well. Sedlmair, Munzner, and Tory (2013) showed that scatter plots were mostly sufficient for detecting class separation, however they also noted that often multiple embeddings were required. For the task of cluster identification, Lewis, Van der Maaten, and Sa (2012) showed experimentally that novice users of non-linear embedding techniques were more likely to consider clusters of points on a scatter plot to be the result of a spurious embedding compared to advanced users who were aware of the inner workings of the embedding algorithm.
A complementary approach for visualizing structure in high dimensional data is the tour. A tour is a sequence of projections of a high dimensional dataset onto a low-dimensional basis matrix, and is represented as an animated visualization (Asimov 1985; Buja and Asimov 1986). Given the dynamic nature of the tour, user interaction is important for controlling and exploring the visualisation: the tour has been used previously by Wickham, Cook, and Hofmann (2015) for exploring statistical model fits and by Buja, Cook, and Swayne (1996) for exploring the space of factorial experimental designs.
While there has been much work on the algorithmic details of embedding methods, there are relatively few tools designed to assist users to interact with these techniques: when is an embedding sufficient for the task at hand? Several interactive interfaces have been proposed for evaluating or using embedding techniques. Buja et al. (2008) used tours to guide analysts during the optimisation of multidimensional scaling methods by extending their interactive visualisation software called XGobi and GGobi into a new tool called GGvis (Swayne, Cook, and Buja 1998; Swayne et al. 2003; Swayne and Buja 2004). Their interface allows the analyst to dynamically modify and check whether an MDS configuration has preserved the locality and closeness of points between the configuration and the original data. Ovchinnikova and Anders (2020) created the Sleepwalk interface for checking non-linear embeddings in single cell RNA-seq data. It provides a click and highlight visualisation for colouring points in an embedding according to an estimated pairwise distance in the original high-dimensional space. Similarly, the TensorFlow embedding projector is a web interface to running some non-liner embedding methods live in the browser and provides interactions to colour points, and select nearest neighbours (Smilkov et al. 2016). Finally, the work by Pezzotti et al. (2017) provides a user guided and modified form of the t-SNE algorithm, that allows users to modify optimisation parameters in real-time.
There is no one-size fits all: finding an appropriate embedding for a given dataset is a difficult and a somewhat poorly defined problem. For non-linear methods, there are a lot of parameters to explore that can have an effect on the resulting visualisation and interpretation. Interfaces for evaluating embeddings require interaction but should also be able to be incorporated into an analysts workflow. Instead, we implement a more pragmatic workflow by incorporating interactive graphics and tours with embeddings that allows users to see a global overview of their high dimensional data and assists them with cluster orientation tasks.
The rest of the paper is organised as follows. The next section provides background on dimension reduction methods, including an overview of the tour. Then we describe the visual design of liminal, followed by implementation details. Next we provide case studies that show how our interface assists in using embedding methods. Finally, we describe the insights gained by using liminal and plans for extensions to the software.
In the following section, we provide a brief overview of the goals of DR methods with respect to the data analyst, in addition to their high level the mathematical details. Here we restrict our attention to two recent methods that are commonly used in the literature: t-distributed stochastic neighbour embedding (t-SNE) and uniform manifold alignment and projection (UMAP); however we do mention several other techniques (Maaten and Hinton 2008; McInnes, Healy, and Melville 2018).
To begin we suppose the data is in the form rectangular matrix of real numbers, \(X = [\mathbf{x_1}, \dots, \mathbf{x_n}]\), where \(n\) is the number of observations in \(p\) dimensions. The purpose of any DR algorithm is to find a low-dimensional representation of the data, \(Y = [\mathbf{y_1}, \dots, \mathbf{y_n}]\), such that \(Y\) is an \(n \times d\) matrix where \(d \ll p\). The hope of the analyst is that the DR procedure to produces \(Y\) will remove noise in the original dataset while retaining any latent structure.
DR methods can be classified into two broad classes: linear and non-linear methods. Linear methods perform a linear transformation of the data, that is, \(Y\) is a linear transformation of \(X\); one example is principal components analysis (PCA) which performs an eigendecomposition of the estimated sample covariance matrix. The eigenvalues are sorted in decreasing order and represent the variance explained by each component (eigenvector). A common approach to deciding on the number of principal components to retain is to plot the proportion of variance explained by each component and choose a cut-off.
For non-linear methods \(Y\) is generated via a pre-processed form of the input \(X\) such as the \(k\)-nearest neighbours graph or via a kernel transformation. Multidimensional scaling (MDS) is a class of DR method that aims to construct an embedding \(Y\) such that the pair-wise distances (inner products) in \(Y\) approximate the pair-wise distances (inner products) in \(X\) (Torgerson 1952; Kruskal 1964a). There are many variants of MDS, such as non-metric scaling which amounts to replacing distances with ranks instead (Kruskal 1964b). A related technique is Isomap which uses a \(k\)-nearest neighbour graph to estimate the pair-wise geodesic distance of points in \(X\) then uses classical MDS to construct \(Y\) (Silva and Tenenbaum 2003). Other approaches are based on diffusion processes such as diffusion maps (Coifman et al. 2005). A recent example of this approach is the PHATE algorithm (Moon et al. 2019). Here an affinity matrix is estimated via the pair-wise distance matrix and k-nearest neighbours graph of \(X\). The algorithm de-noises estimated distances in high-dimensional space via transforming the affinity matrix into a Markov transition probability matrix and diffusing this matrix over a fixed number of time steps. Then the diffused probabilities are transformed once more to construct a distance matrix, and classical MDS is used to generate \(Y\). A general difficulty with using non-linear DR methods for exploratory data analysis is selecting and tuning appropriate parameters. To see the extent of these choices we now examine the underpinnings of t-SNE and UMAP.
The t-SNE algorithm estimates the pair-wise similarity of (Euclidean) distances of points in a high dimensional space using a Gaussian distribution and then estimates a configuration in the low dimensional embedding space by modelling similarities using a t-distribution with 1 degree of freedom (Maaten and Hinton 2008). There are several subtleties to the to use of the algorithm that are revealed by stepping through its machinery.
To begin, t-SNE transforms pair-wise distances between \(\mathbf{x_i}\) and \(\mathbf{x_j}\) to similarities using a Gaussian kernel:
\[ p_{i|j} = \frac{\exp(-\lVert \mathbf{x_i - x_j} \rVert ^ 2 / 2\sigma_i^2)}{\sum_{k \ne i}\exp(-\lVert \mathbf{x_j - x_k} \rVert ^ 2 / 2\sigma_i^2)} \]
The conditional probabilities are then normalised and symmetrised to form a joint probability distribution via averaging:
\[ p_{ij} = \frac{p_{i|j} + p_{j|i}}{2n} \]
The variance parameter of the Gaussian kernel is controlled by the analyst using a fixed value of perplexity for all observations:
\[ \text{perplexity}_i = \exp(-\log(2) \sum_{i \ne j}p_{j|i}\log_2(p_{j|i})) \] As the perplexity increases, \(\sigma^2_{i}\) increases, until its bounded above by the number of observations , \(n-1\), in the data, corresponding to \(\sigma^2_{i} \rightarrow \infty\). This essentially turns \(t-SNE\) into a nearest neighbours algorithm, \(p_{i|j}\) will be close to zero for all observations that are not in the \(\mathcal{O}(\text{perplexity}_i)\) neighbourhood graph of the \(i\)th observation (Maaten 2014).
Next, the target low-dimensional space, \(Y\), pair-wise distances between \(\mathbf{y_i}\) and \(\mathbf{y_j}\) are modelled as a symmetric probability distribution using a t-distribution with one degree of freedom (Cauchy kernel):
\[ q_{ij} = \frac{w_{ij}}{Z} \\ \text{where } w_{ij} = \frac{1}{ 1 + \lVert \mathbf{y_i - y_j} \rVert ^ 2} \text{ and } Z = \sum_{k \ne l} w_{kl}. \]
The resulting embedding \(Y\) is the one that minimizes the Kullback-Leibler divergence between the probability distributions formed via similarities of observations in \(X\), \(\mathcal{P}\) and similarities of observations in \(Y\), \(\mathcal{Q}\):
\[ \mathcal{L(\mathcal{P}, \mathcal{Q})} = \sum_{i \ne j} p_{ij} \log \frac{p_{ij}}{q_{ij}}\]
Re-writing the loss function in terms of attractive (right) and repulsive (left) forces we obtain:
\[ \mathcal{L(\mathcal{P}, \mathcal{Q})} = -\sum_{i \ne j} p_{ij}\log w_{ij} + \log\sum_{i \ne j} w_{ij} \]
So when the loss function is minimised this corresponds to large attractive forces, that is, the pair-wise distances in \(Y\) are small when there are non-zero \(p_{ij}\), i.e. \(x_i\) and \(x_j\) are close together. The repulsive force should also be small when the loss function is minimised, that is, when pair-wise distances in \(Y\) are large regardless of the magnitude of the corresponding distances in \(X\).
Taken together, these details reveal the sheer number of decisions that an analyst must make. How does one choose the perplexity? How should the parameters that control the optimisation of the loss function (done with stochastic gradient descent), like the number of iterations, or early exaggeration ( a multiplier of the attractive force at the beginning of the optimisation), or the learning rate be selected? It is a known problem that t-SNE can have trouble recovering topology and that configurations can be highly dependent on how the algorithm is initialised and parameterized (Wattenberg, Viégas, and Johnson 2016; Kobak and Berens 2019; Melville 2020). If the goal is cluster orientation a recent theoretical contribution by Linderman and Steinerberger (2019) proved that t-SNE can recover spherical and well separated cluster shapes, and proposed new approaches for tuning the optimisation parameters. However, the cluster sizes and their relative orientation from a \(t-SNE\) view can be misleading perceptually, due to the algorithms emphasis on locality.
Another recent method, UMAP, has seen a large rise in popularity (at least in single cell transcriptomics) (McInnes, Healy, and Melville 2018). It is a method that is related to LargeVis (Tang et al. 2016), and like t-SNE acts on the k-nearest neighbour graph. Its main differences are that it uses a different cost function (cross entropy) which is optimized using stochastic gradient descent and defines a different kernel for similarities in the low dimensional space. Due to its computational speed it is possible to generate UMAP embeddings in more than three dimensions. It appears to suffer from the same perceptual issues as t-SNE, however it supposedly preserves global structure better than t-SNE (Coenen and Pearce 2019).
The tour is a visualisation technique that is grounded in mathematical theory, and is able to ascertain the shape and global structure of a dataset via inspection of the subspace generated by the set of low-dimensional projections (Asimov 1985; Buja and Asimov 1986).
Like, when, using other DR techniques, the tour assumes we have a real data matrix \(X\) consisting of \(n\) observations in \(p\) dimensions. First, the tour generates a sequence of \(p \times d\) orthonormal projection matrices (bases) \(A_{t \in \mathbb{N}}\), where \(d\) is typically \(1\) or \(2\). For each pair of orthonormal bases \(A_{t}\) and \(A_{t+1}\) that are generated, the geodesic path between them is interpolated to form intermediate frames, and giving the sense of continuous movement from one basis to another. The tour is then the continuous visualisation of the projections \(Y_{t} = XA_{t}\), that is the projection of \(X\) onto \(A_{t}\) as the tour path is interpolated between successive bases. A grand tour corresponds to choosing new orthonormal bases at random; allowing a user to ascertain structure via exploring the subspace of \(d\)-dimensional projections. In practice, we first sphere our data via principal components to reduce dimensionality of \(X\) prior to running the tour. Instead of picking projections at random, a guided tour can be used to generate a sequence of ‘interesting’ projections as quantified by an index function (Cook et al. 1995). While our software, liminal is able to visualise guided tours, our focus in the case studies uses the grand tour to see global structure in the data.
Tours provide a supportive visualisation to NLDR graphics, and can be easily incorporated into an analysts workflow with our software package, liminal. Our interface allows analysts to quickly compare views from embedding methods and see how an embedding method preserves or alters the geometry of their data. Using multiple concatenated and linked views with the tour enhances interaction techniques, and allows analysts to perform cluster orientation tasks via linked highlighting and brushing (McDonald 1982; Becker and Cleveland 1987). This approach allows our interface to achieve the three principles for interactive high-dimensional data visualisation outlined by Buja, Cook, and Swayne (1996): finding gestalt, posing queries, and making comparisons.
To investigate latent structure and the shape of a high dimensional dataset, a tour can be run without the use of an external embedding. It is often useful to first run principal components on the input as an initial dimension reduction step, and then tour a subset of those components instead, i.e. by selecting them from a scree plot. The default tour layout is a scatter plot with an axis layout displaying the magnitude and direction of each basis vector. Since the tour is dynamic, it is often useful to be able to pause and highlight a particular view. In addition to pause, play and reset buttons, brushing will pause the tour path, allowing users to identify ‘interesting’ projections. The domain of the axis scales from running a tour is called the half range, and is computed by rescaling the input data onto hyper-dimensional unit cube. We bind the half range to a mouse wheel event, allowing a user to pan and zoom on the tour view dynamically. This is useful for peeling back dense clumps of points to reveal structure.
We have combined the tour view in a side by side layout with a scatter plot view as has been done in previous tour interfaces XGobi and DataViewer (Buja, Hurley, and McDonald 1986; Swayne, Cook, and Buja 1998). These views are linked; analysts can brush regions or highlight collections of points in either view. Linked highlighting can be performed when points have been previously labelled according to some discrete structure, i.e. cluster labels are available. This is achieved via the analyst clicking on groups in the legend, which causes unselected groupings to have their points become less opaque. Consequently, simple linked highlighting can alleviate a known downfall of methods such as UMAP or t-SNE: that is distances between clusters are misleading. By highlighting corresponding clusters in the tour view, the analyst can see the relationship between clusters, and therefore obtain a more accurate representation of the topology of their data.
Simple linked brushing is achieved via mouse-click and drag movements. By default, when brushing occurs in the tour view, the current projection is paused and corresponding points in the embedding view are highlighted. Likewise, when brushing occurs in the embedding view, corresponding points in the tour view are highlighted. In this case, an analyst can use brushing for manually identifying clusters and verifying cluster locations and shapes: brushing in the embedding view gives analysts a sense of the shape and proximity of cluster in high-dimensional space.
As mentioned previously, when using any DR method, we are assuming the embedding is representative of the high-dimensional dataset it was computed from. Defining what it means for embedding to be ‘representative` or ’faithful’ to high-dimensional data is ill-posed and depends on the underlying task an analyst is trying to achieve. At the very minimum, we are interested in distortions and diffusions of the high-dimensional data. Distortions occur when points that are near each other in the embedding view are far from each other in the original dataset. This implies that the embedding is not continuous. Diffusions occur when points are far from each other in the embedding view are near in the original data. Whether, points are near, or far is reliant on the distance metric used; distortions and diffusions can be thought of as the preservation of distances or the nearest neighbours graphs between the high-dimensional space and the embedding space. As distances can be noisy in high-dimensions, ranks can be used instead as has been proposed by Lee and Verleysen (2009). Identifying distortions and diffusions allows an analyst to investigate the quality of their embedding and revise them iteratively.
These checks are done visually using our side-by-side tour and embedding views. In the simplest case, a local continuity check can be assessed via one to one linked brushing from the embedding to the tour view. Similarly, diffusions are identified from linked brushing on the tour view, highlighting points in the embedding view.
We have implemented the above design as an open source R package called liminal (Lee and Cook 2020). The package allows analysts to construct concatenated visualisations, drawn with the Vega-Lite grammar of interactive graphics via the vegawidget package (Satyanarayan et al. 2017; Lyttle and Vega/Vega-Lite Developers 2020). It provides an interface for constructing linked and stand alone interfaces for manipulating tour paths via the shiny and tourr packages (Chang et al. 2020; Wickham et al. 2011).
The process of generating successive bases and interpolating between them to
construct intermediary frames, means the tour is a dynamic visualisation
technique. Generally, the user would set \(d=2\) and the tour is visualised as
an animated scatter plot. This process of constructing bases and intermediate
frames and visualising the resulting projection is akin to making a “flip book”
animation. Like with a flip book, an interface to the tour requires the ability
to interact and modify it in real time. The user interface generated in liminal
allows a user to play, pause, and reset the tour animation,
panning and zooming to modify the scales of the plot to provide context and
click events to highlight groups of points if a labelling variable has
been placed on the legend.
These interactions are enabled by treating the basis generation as a reactive stream. Instead of realising the entire sequence, which limits the animation to have a discrete number of frames, new bases and their intermediate frames are generated dynamically via pushing the current projection to the visualisation interface. The interface listens to events like pressing a button or mouse-dragging and reacts by pausing the stream. This process allows the user to manipulate the tour in real time rather than having to fix the number of bases ahead of time. Additionally, once the user has identified an interesting projection or is done with the tour, the interface will return the current basis for use downstream.
The embedding and tour views are linked together via rectangular brushes; when a brush is active, points will be highlighted in the adjacent view. Because the tour is dynamic, brush events that become active will pause the animation, so that a user can interrogate the current view. By default, brushing on the embedding view will produce a one-to-one linking with the tour view. For interpreting specific combinations of clusters, the multiple guides on the legend can be selected in order to see their relative orientations. The interface is constructed as a shiny gadget specifically designed for interactive data analysis. Selections such as brushing regions and the current tour path are returned after the user clicks done on the interface and become available for further investigation.
The next section steps through case studies of our approach using simulations and an application to single cell RNA-seq data.
The first three case studies use simulations where the cluster structure and geometry of the underlying data is known. We start with a simple example where we generated spherical clusters that are embedded well by t-SNE. Then we move onto more complex examples where the tour provides insight, such as clusters that have substructure and where there is more complex geometry in the data.
In the final case study, we apply our approach to clustering the mouse retina data from Macosko et al. (2015), and apply the tour to the process of verifying marker genes that separate clusters.
We strongly recommend viewing the linked videos for each case study while reading. Links to the videos are available in table 1 and in the figures for each case study. The videos presented show the visual appearance of the liminal interface, and how we can interact with the tour via the controls previously described. If you are unable to view the videos, the figures in each case study consist of screenshots that summarise what is learned from combining the tour and an embedding view.
To begin we look at simulated datasets that reproduce known facts about the t-SNE algorithm. Our first data set consists of five spherical 5-\(d\) Gaussian clusters embedded in 10-\(d\) space, each cluster has the same covariance matrix. We then computed a t-SNE layout with default settings using the Rtsne package (Krijthe 2015), and set up the liminal linked interface with grand tour on the 10-\(d\) observations.
Figure 1: Screenshots of the liminal interface applied to well clustered data, a video of the tour animation is available at https://player.vimeo.com/video/439635921.
From the video linked inn figure 1, we learn that t-SNE has correctly split out each cluster and laid them out in a star like formation. This is agrees with the tour view, where once we start the animation, the five clusters begin to appear but generally points are more concentrated in the projection view compared to the t-SNE layout (figure 1a). This can be seen via brushing the t-SNE view (figure 1b).
Next we view Gaussian clusters from the Multi Challenge Dataset, a benchmark simulation data set for clustering tasks (Rauber 2009). This dataset has two Gaussian clusters with equal covariance embedded in 10-\(d\), and a third cluster with hierarchical structure. This cluster has two 3-\(d\) clusters embedded in 10-\(d\), where the second cluster is subdivided into three smaller clusters, that are each equidistant from each other and have the same covariance structure. From the video linked in figure 2, we see that t-SNE has correctly identified the sub-clusters. However, their relative locations to each other is distorted, with the orange and blue groups being far from each other in the tour view (figure 2a). We see in this case that is difficult to see the sub-clusters in the tour view, however, once we zoom and highlight they become more apparent (figure 2b). When we brush the sub-clusters in the t-SNE, their relative placement is again exaggerated, with the tour showing that they are indeed much closer than the impression the t-SNE view gives.
Figure 2: Screenshots of the liminal interface applied to sub-clustered data, a video of the tour animation is available at https://player.vimeo.com/video/43963590.
Next we explore some simulated noisy tree structured data (figure 3), our interest here is how t-SNE visualisations break topology of the data, and then seeing if we can resolve this by tweaking the default parameters with reference to the global view of the data set. This simulation aims to mimic branching trajectories of cell differentiation: if there were only mature cells, we would just see the tips of the branches which have a hierarchical pattern of clustering.
Figure 3: Example high-dimensional tree shaped data, \(n = 3000\) and \(p = 100\). (a) The true data lies on 2-\(d\) tree consisting of ten branches. This data is available in the phateR package and is simulated via diffusion-limited aggregation (a random walk along the branches of the tree) with Gaussian noise added (Moon et al. 2019). (b) The first two principal components, which form the initial projection for the tour, note that the backbone of the tree is obscured by this view. (c) The default t-SNE view breaks the global structure of the tree. (d) Altering t-SNE using the first two principal components as the starting coordinates for the embedding, results in clustering the tree at its branching points.
First, we apply principal components and restrict the results down to the first twelve principal components (which makes up approximately 70% of the variance explained in the data) to use with the grand tour.
Moreover, we run t-SNE using the default arguments on the complete data (this keeps the first 50 PCs, sets the perplexity to equal 30 and performs random initialisation). We then create a linked tour with t-SNE layout with liminal as shown in figure 4.
Figure 4: Screenshots of the liminal interface applied to tree structured data, a video of the tour animation is available at https://player.vimeo.com/video/439635892.
From the linked video, we see that the t-SNE view has been unable to recapitulate the topology of the tree - the backbone (blue) branch has been split into three fragments (figure 4a). We can see this immediately via the linked highlighting over both plots. If we click on the legend for the zero branch, the blue coloured points on each view are highlighted and the remaining points are made transparent. From, here it becomes apparent from the tour view that the blue branch forms the backbone of the tree and is connected to all other branches. From the video it is easy to see that cluster sizes formed via t-SNE can be misleading; from the tour view there is a lot of noise along the branches, while this does not appear to be the case for the t-SNE result (figure 4b).
From the first view, we modify the inputs to the t-SNE view, to try and produce a better trade-off between local structure and retain the topology of the data. We keep every parameter the same except that we initialise \(Y\) with the first two PCs (scaled to have standard deviation 1e-4) instead of the default random initialisation and increase the perplexity from 30 to 100. We then combine these results with our tour view as displayed in the linked video in the caption of figure 5.
Figure 5: Screenshots of the liminal interface applied to tree structured data, a video of the tour animation is available at https://player.vimeo.com/video/439635863.
The video linked in figure 5 shows that this selection of parameters results in the tips of the branches (the three black dots in figure 3a) being split into three clusters representing the terminal branches of the tree. However, there are perceptual issues following the placement of the three groupings on the t-SNE view that become apparent via simple linked brushing. If we brush the tips of the yellow and brown branches (which appear to be close to each other on the t-SNE view), we immediately see the placement is distorted in the t-SNE view, and in the tour view these tips are at opposite ends of the tree (figure 5b). Although, this is a known issue of the t-SNE algorithm, we can easily identify it via simple interactivity without knowing the inner workings of the method.
A common analysis task in single cell studies is performing clustering to identify groupings of cells with similar expression profiles. Analysts in this area generally use non linear DR methods for verification and identification of clusters and developmental trajectories (i.e., case study 1). For clustering workflows the primary task is verify the existence of clusters and then begin to identify the clusters as cell types using the expression of “known” marker genes. Here a ‘faithful’ a embedding should ideally preserve the topology of the data; cells that correspond to a cell type should lie in the same neighbourhood in high-dimensional space. In this case study we use our linked brushing approaches to look at neighbourhood preservation and look at marker genes through the lens of the tour. The data we have selected for this case study has features similar to those found in case studies 2 and 3.
First, we downloaded the raw mouse retinal single cell RNA-seq data from Macosko et al. (2015) using the scRNAseq Bioconductor package (Risso and Cole 2019). We have followed a standard workflow for pre-processing and normalizing this data (described by Amezquita et al. (2020)): we performed QC using the scater package by removing cells with high proportion of mitochondrial gene expression and low numbers of genes detected, we log-transformed and normalised the expression values and finally selected highly variable genes (HVGs) using scran (McCarthy et al. 2017; Lun, McCarthy, and Marioni 2016). The top ten percent of HVGs were used to subset the normalised expression matrix and compute PCA using the first 25 components. Using the PCs we built a shared nearest neighbours graph (with \(k = 10\)) and used Louvain clustering to generate clusters (Blondel et al. 2008).
To check and verify the clustering we construct a liminal view. We tour the first five PCs (approximately 20% of the variance in expression), alongside the t-SNE view which was computed from all 25 PCs. Due to latency of the interface we do a weighted sample of the rows based on cluster membership, leaving us with approximately 10 per cent of the original data size - 4,590 cells. Although this is not ideal, it still allows us to get a sense of the shape of the clusters as seen from the linked video in figure 6.
Figure 6: Screenshots of the liminal interface applied to single cell data, a video of the tour animation is available at https://player.vimeo.com/video/439635812.
From the video linked in figure 6, we learn that the embedding has mostly captured the clusters relative location to each other to their location in high dimensional space, with a notable exception of points in cluster 3 and 10 as shown with linked brushing (figure 6a). As expected, t-SNE mitigates the crowding problem that is an issue for tour in this case, where points in clusters 2,4,6, and 11 are clumped together in tour view, but are blown up in the embedding view (figure 6b). The tour appears to form a tetrahedron like shape, with points lying on surface and along the vertices of the tetrahedron in 5-\(d\) PCA space - a phenomena that has also been observed in Korem et al. (2015) (figure 6c). Brushing on the tour view, reveals points in cluster 9 that are diffuse in the embedding view, points in cluster 9 are relatively far away and spread apart from other clusters in the tour view, but has points placed in cluster 3 and 9 in the embedding (figure 6d).
Next, we identify marker genes for clusters using one sided Welch t-tests with a minimum log fold change of one as recommended by Amezquita et al. (2020), which uses the testing framework from McCarthy and Smyth (2009). We select the top 10 marker genes that are upregulated in cluster 2, which was one of the clumped clusters when we toured on principal components. Here the tour becomes an alternative to a standard heatmap view for assessing shared markers; the basis generation (shown as the biplot on the left view) reflects the relative weighting of each gene. We run the tour directly on the log normalised expression values using the same subset as before.
Figure 7: Screenshots of the liminal tour applied to a marker gene set, a video of the tour animation is available at https://player.vimeo.com/video/439635843.
From the video linked in figure 7, we see that the expression of the marker genes, appear to separate the previously clumped clusters 2,4,6, and 11 from the other clusters, indicating that these genes are expressed in all four clusters (figure 7a). After zooming, we can see a trajectory forming along the clusters, while the axis view shows that magnitude of expression in the marker genes is similar across these separated clusters which is consistent with the results of marker gene analysis (figure 7b).
We have shown that the use of tours as a tool for interacting with high dimensional data provides an additional insight for interrogating views generated from embeddings. The interface we have designed in the liminal package, allows a user to gain a deeper understanding of an embedding algorithm, and rectifies perceptual issues associated with NLDR methods via linked interactions with the tour. As we have shown in the simulation case studies, the t-SNE method can produce misleading embeddings which can be detected through the use linked brushing and highlighting. In the case when the data has a piecewise linear geometry, like the tree simulation, the tour preserves the shape of the data which can be obscured by the embedding method.
Our framework can also be useful in practice, as displayed in the fourth case study. The tour when combined with t-SNE allowed us to identify clusters, while giving us an idea of their orientation to each other. Moreover, we could visually inspect the separation of clusters using a tour on marker gene sets. We see our approach as being valuable to the single cell analyst who wants to make their embeddings more interpretable.
We have shown in the case studies, that one to one linked brushing can be used to identify distortions in the embedding, however we would like extend this to one to many linked brushing, which would allow us to directly interrogate neighbourhood preservation. This form of brushing acts directly on a \(k\)-nearest neighbours (\(k\)-nn) graph computed from a reference dataset: when a user brushes over a region in the embedding, all the points that match the graphs edges are selected on the corresponding tour view. The reference data set for computing nearest neighbours (for example a distance matrix, or the complete data matrix) can be independent of the tour or embedding views. In place of highlighting, one could use opacity or binned colour scales to encode distances or ranks instead of the neighbouring points. We have begun implementing this brush in liminal, using the FNN or RcppAnnoy packages for fast neighbourhood estimation on the server side, however there are still technicalities that need be resolved (Beygelzimer et al. 2019; Eddelbuettel 2020). Brush composition, such as ‘and’, ‘or’, or ‘not’ brushes, could be used to further investigate mismatches between the \(k\)-nn graphs estimated from both the embedding and tour views.
There are some limitations in using the liminal interface for larger datasets. First, t-SNE avoids the crowding problem, points are separated into distinct regions on the display canvas, while for the tour, points are concentrated in the centre of the projection and become difficult to see. Second, as \(n\) increases the animation becomes slower as there is more data passing from the server to the client. Both of these issues could be mitigated through alternate tour paths and displays. The recent work by Laa et al. (2020) is a promising area for further investigation, as well as work from topological statistics (Rieck 2017; Genovese et al. 2017).
We would like to thank Dr David Frazier, Dr Ursula Laa and Dr Paul Harrison for their feedback on this work. The screenshots and images were compiled using the cowplot and ggplot2 packages (Wilke 2019; Wickham 2016).
Code, data, and video for reproducing this paper are available at https://github.com/sa-lee/paper-liminal. Direct links to videos for viewing online are available in table 1.
| Case Study | Example | URL |
|---|---|---|
| 1 | gaussian | https://player.vimeo.com/video/439635921 |
| 2 | hierarchical | https://player.vimeo.com/video/439635905 |
| 3 | trees-01 | https://player.vimeo.com/video/439635892 |
| 3 | trees-02 | https://player.vimeo.com/video/439635863 |
| 4 | mouse-01 | https://player.vimeo.com/video/439635812 |
| 4 | mouse-02 | https://player.vimeo.com/video/439635843 |
Amezquita, Robert A, Aaron T L Lun, Etienne Becht, Vince J Carey, Lindsay N Carpp, Ludwig Geistlinger, Federico Marini, et al. 2020. “Orchestrating Single-Cell Analysis with Bioconductor.” Nat. Methods 17 (2): 137–45. https://doi.org/10.1038/s41592-019-0654-x.
Asimov, Daniel. 1985. “The Grand Tour: A Tool for Viewing Multidimensional Data.” SIAM J. Sci. And Stat. Comput. 6 (1): 128–43. https://doi.org/10.1137/0906011.
Becker, Richard A, and William S Cleveland. 1987. “Brushing Scatterplots.” Technometrics 29 (2): 127–42. https://doi.org/10.2307/1269768.
Beygelzimer, Alina, Sham Kakadet, John Langford, Sunil Arya, David Mount, and Shengqiao Li. 2019. FNN: Fast Nearest Neighbor Search Algorithms and Applications. https://CRAN.R-project.org/package=FNN.
Blondel, Vincent D, Jean-Loup Guillaume, Renaud Lambiotte, and Etienne Lefebvre. 2008. “Fast Unfolding of Communities in Large Networks.” J. Stat. Mech. 2008 (10): P10008. https://doi.org/10.1088/1742-5468/2008/10/P10008.
Brehmer, Matthew, Michael Sedlmair, Stephen Ingram, and Tamara Munzner. 2014. “Visualizing Dimensionally-Reduced Data: Interviews with Analysts and a Characterization of Task Sequences.” In Proceedings of the Fifth Workshop on Beyond Time and Errors: Novel Evaluation Methods for Visualization, 1–8. dl.acm.org. https://dl.acm.org/doi/abs/10.1145/2669557.2669559.
Buja, Andreas, and Daniel Asimov. 1986. “Grand Tour Methods: An Outline.” In Proceedings of the Seventeenth Symposium on the Interface of Computer Sciences and Statistics on Computer Science and Statistics, 63–67. USA: Elsevier North-Holland, Inc. https://dl.acm.org/doi/10.5555/26036.26046.
Buja, Andreas, Dianne Cook, and Deborah F Swayne. 1996. “Interactive High-Dimensional Data Visualization.” J. Comput. Graph. Stat. 5 (1): 78–99. https://doi.org/10.2307/1390754.
Buja, Andreas, Catherine Hurley, and John McDonald. 1986. “A Data Viewer for Multivariate Data.” In Computing Science and Statistics: Proceedings of the 18th Symposium on the Interface, 171–74. Washington: American Statistical Association.
Buja, Andreas, Deborah F Swayne, Michael L Littman, Nathaniel Dean, Heike Hofmann, and Lisha Chen. 2008. “Data Visualization with Multidimensional Scaling.” J. Comput. Graph. Stat. 17 (2): 444–72. https://doi.org/10.1198/106186008X318440.
Chang, Winston, Joe Cheng, J J Allaire, Yihui Xie, and Jonathan McPherson. 2020. “Shiny: Web Application Framework for R.” https://CRAN.R-project.org/package=shiny.
Coenen, Andy, and Adam Pearce. 2019. “Understanding UMAP.” https://pair-code.github.io/understanding-umap/. https://pair-code.github.io/understanding-umap/.
Coifman, R R, S Lafon, A B Lee, M Maggioni, B Nadler, F Warner, and S W Zucker. 2005. “Geometric Diffusions as a Tool for Harmonic Analysis and Structure Definition of Data: Diffusion Maps.” Proc. Natl. Acad. Sci. U. S. A. 102 (21): 7426–31. https://doi.org/10.1073/pnas.0500334102.
Cook, Dianne, Andreas Buja, Javier Cabrera, and Catherine Hurley. 1995. “Grand Tour and Projection Pursuit.” J. Comput. Graph. Stat. 4 (3): 155–72. https://doi.org/10.1080/10618600.1995.10474674.
Eddelbuettel, Dirk. 2020. RcppAnnoy: ’Rcpp’ Bindings for ’Annoy’, a Library for Approximate Nearest Neighbors. https://CRAN.R-project.org/package=RcppAnnoy.
Genovese, Christopher, Marco Perone-Pacifico, Isabella Verdinelli, and Larry Wasserman. 2017. “Finding Singular Features.” J. Comput. Graph. Stat. 26 (3): 598–609. https://doi.org/10.1080/10618600.2016.1260472.
Kobak, Dmitry, and Philipp Berens. 2019. “The Art of Using t-SNE for Single-Cell Transcriptomics.” Nat. Commun. 10 (1): 5416. https://doi.org/10.1038/s41467-019-13056-x.
Korem, Yael, Pablo Szekely, Yuval Hart, Hila Sheftel, Jean Hausser, Avi Mayo, Michael E Rothenberg, Tomer Kalisky, and Uri Alon. 2015. “Geometry of the Gene Expression Space of Individual Cells.” PLoS Comput. Biol. 11 (7): e1004224. https://doi.org/10.1371/journal.pcbi.1004224.
Krijthe, Jesse H. 2015. Rtsne: T-Distributed Stochastic Neighbor Embedding Using Barnes-Hut Implementation. https://github.com/jkrijthe/Rtsne.
Kruskal, J B. 1964a. “Multidimensional Scaling by Optimizing Goodness of Fit to a Nonmetric Hypothesis.” Psychometrika 29 (1): 1–27. https://doi.org/10.1007/BF02289565.
———. 1964b. “Nonmetric Multidimensional Scaling: A Numerical Method.” Psychometrika 29 (2): 115–29. https://doi.org/10.1007/BF02289694.
Laa, Ursula, Dianne Cook, Andreas Buja, and German Valencia. 2020. “Hole or Grain? A Section Pursuit Index for Finding Hidden Structure in Multiple Dimensions,” April. http://arxiv.org/abs/2004.13327.
Lee, John A, and Michel Verleysen. 2009. “Quality Assessment of Dimensionality Reduction: Rank-Based Criteria.” Neurocomputing 72 (7): 1431–43. https://doi.org/10.1016/j.neucom.2008.12.017.
Lee, Stuart, and Dianne Cook. 2020. Liminal: Multivariate Data Visualization with Tours and Embeddings. https://github.com/sa-lee/liminal.
Lewis, Joshua, Laurens Van der Maaten, and Virginia de Sa. 2012. “A Behavioral Investigation of Dimensionality Reduction.” In Proceedings of the Annual Meeting of the Cognitive Science Society. Vol. 34.
Linderman, George C, and Stefan Steinerberger. 2019. “Clustering with t-SNE, Provably.” SIAM Journal on Mathematics of Data Science 1 (2): 313–32. https://doi.org/10.1137/18M1216134.
Lun, Aaron T. L., Davis J. McCarthy, and John C. Marioni. 2016. “A Step-by-Step Workflow for Low-Level Analysis of Single-Cell Rna-Seq Data with Bioconductor.” F1000Res. 5: 2122. https://doi.org/10.12688/f1000research.9501.2.
Lyttle, Ian, and Vega/Vega-Lite Developers. 2020. “Vegawidget: ’Htmlwidget’ for ’Vega’ and ’Vega-Lite’.” https://CRAN.R-project.org/package=vegawidget.
Maaten, Laurens van der. 2014. “Accelerating t-SNE Using Tree-Based Algorithms.” J. Mach. Learn. Res. http://www.jmlr.org/papers/volume15/vandermaaten14a/vandermaaten14a.pdf.
Maaten, Laurens van der, and Geoffrey Hinton. 2008. “Visualizing Data Using t-SNE.” J. Mach. Learn. Res. 9 (Nov): 2579–2605. http://www.jmlr.org/papers/v9/vandermaaten08a.html.
Macosko, Evan Z, Anindita Basu, Rahul Satija, James Nemesh, Karthik Shekhar, Melissa Goldman, Itay Tirosh, et al. 2015. “Highly Parallel Genome-Wide Expression Profiling of Individual Cells Using Nanoliter Droplets.” Cell 161 (5): 1202–14. https://doi.org/10.1016/j.cell.2015.05.002.
McCarthy, Davis J., Kieran R. Campbell, Aaron T. L. Lun, and Quin F. Willis. 2017. “Scater: Pre-Processing, Quality Control, Normalisation and Visualisation of Single-Cell RNA-Seq Data in R.” Bioinformatics 33 (8): 1179–86. https://doi.org/10.1093/bioinformatics/btw777.
McCarthy, Davis J, and Gordon K Smyth. 2009. “Testing Significance Relative to a Fold-Change Threshold Is a TREAT.” Bioinformatics 25 (6): 765–71. https://doi.org/10.1093/bioinformatics/btp053.
McDonald, J A. 1982. “Interactive Graphics for Data Analysis.” PhD thesis, Stanford University. http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.331.6673&rep=rep1&type=pdf.
McInnes, Leland, John Healy, and James Melville. 2018. “UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction,” February. http://arxiv.org/abs/1802.03426.
Melville, James. 2020. “t-SNE Initialization Options.” https://jlmelville.github.io/smallvis/init.html. https://jlmelville.github.io/smallvis/init.html.
Moon, Kevin R, David van Dijk, Zheng Wang, Scott Gigante, Daniel B Burkhardt, William S Chen, Kristina Yim, et al. 2019. “Visualizing Structure and Transitions for Biological Data Exploration.” bioRxiv. https://doi.org/10.1101/120378.
Nguyen, Lan Huong, and Susan Holmes. 2019. “Ten Quick Tips for Effective Dimensionality Reduction.” PLoS Comput. Biol. 15 (6): e1006907. https://doi.org/10.1371/journal.pcbi.1006907.
Ovchinnikova, Svetlana, and Simon Anders. 2020. “Exploring Dimension-Reduced Embeddings with Sleepwalk.” Genome Res. 30 (5): 749–56. https://doi.org/10.1101/gr.251447.119.
Pezzotti, Nicola, Boudewijn P F Lelieveldt, Laurens Van Der Maaten, Thomas Hollt, Elmar Eisemann, and Anna Vilanova. 2017. “Approximated and User Steerable tSNE for Progressive Visual Analytics.” IEEE Trans. Vis. Comput. Graph. 23 (7): 1739–52. https://doi.org/10.1109/TVCG.2016.2570755.
Rauber, Andreas. 2009. “Multi-Challenge Data Set.” http://ifs.tuwien.ac.at/dm/dataSets.html. http://ifs.tuwien.ac.at/dm/dataSets.html.
Rieck, Bastian. 2017. “Persistent Homology in Multivariate Data Visualization.” Edited by Heike Little and Michael Gertz. PhD thesis, Ruprecht-Karls-Universität Heidelberg.
Risso, Davide, and Michael Cole. 2019. ScRNAseq: Collection of Public Single-Cell Rna-Seq Datasets.
Satyanarayan, Arvind, Dominik Moritz, Kanit Wongsuphasawat, and Jeffrey Heer. 2017. “Vega-Lite: A Grammar of Interactive Graphics.” IEEE Trans. Vis. Comput. Graph. 23 (1): 341–50. https://doi.org/10.1109/TVCG.2016.2599030.
Sedlmair, Michael, Tamara Munzner, and Melanie Tory. 2013. “Empirical Guidance on Scatterplot and Dimension Reduction Technique Choices.” IEEE Trans. Vis. Comput. Graph. 19 (12): 2634–43. https://doi.org/10.1109/TVCG.2013.153.
Silva, V D, and J B Tenenbaum. 2003. “Global Versus Local Methods in Nonlinear Dimensionality Reduction.” Adv. Neural Inf. Process. Syst. http://papers.nips.cc/paper/2141-global-versus-local-methods-in-nonlinear-dimensionality-reduction.pdf.
Smilkov, Daniel, Nikhil Thorat, Charles Nicholson, Emily Reif, Fernanda B Viégas, and Martin Wattenberg. 2016. “Embedding Projector: Interactive Visualization and Interpretation of Embeddings,” November. http://arxiv.org/abs/1611.05469.
Swayne, Deborah F, and Andreas Buja. 2004. “Exploratory Visual Analysis of Graphs in GGOBI.” In COMPSTAT 2004 — Proceedings in Computational Statistics, 477–88. Physica-Verlag HD. https://doi.org/10.1007/978-3-7908-2656-2\_39.
Swayne, Deborah F, Dianne Cook, and Andreas Buja. 1998. “XGobi: Interactive Dynamic Data Visualization in the X Window System.” J. Comput. Graph. Stat. 7 (1): 113–30. https://doi.org/10.1080/10618600.1998.10474764.
Swayne, Deborah F, Duncan Temple Lang, Andreas Buja, and Dianne Cook. 2003. “GGobi: Evolving from XGobi into an Extensible Framework for Interactive Data Visualization.” Comput. Stat. Data Anal. 43 (4): 423–44. https://doi.org/10.1016/S0167-9473(02)00286-4.
Tang, Jian, Jingzhou Liu, Ming Zhang, and Qiaozhu Mei. 2016. “Visualizing Large-Scale and High-Dimensional Data,” February. http://arxiv.org/abs/1602.00370.
Torgerson, Warren S. 1952. “Multidimensional Scaling: I. Theory and Method.” Psychometrika 17 (4): 401–19. https://doi.org/10.1007/BF02288916.
Wattenberg, Martin, Fernanda Viégas, and Ian Johnson. 2016. “How to Use t-SNE Effectively.” Distill 1 (10). https://doi.org/10.23915/distill.00002.
Wickham, Hadley. 2016. Ggplot2: Elegant Graphics for Data Analysis. Use R! Springer International Publishing. https://doi.org/10.1007/978-3-319-24277-4.
Wickham, Hadley, Dianne Cook, and Heike Hofmann. 2015. “Visualizing Statistical Models: Removing the Blindfold.” Statistical Analysis and Data Mining: The ASA Data Science Journal 8 (4): 203–25. https://doi.org/10.1002/sam.11271.
Wickham, Hadley, Dianne Cook, Heike Hofmann, and Andreas Buja. 2011. “Tourr: An R Package for Exploring Multivariate Data with Projections.” Journal of Statistical Software, Articles 40 (2): 1–18. https://doi.org/10.18637/jss.v040.i02.
Wilke, Clause O. 2019. Cowplot: Streamlined Plot Theme and Plot Annotations for ’Ggplot2’. https://CRAN.R-project.org/package=cowplot.